Optical Character Recognition by Open source OCR Tool Tesseract: A Case Study

نویسندگان

  • Chirag Patel
  • Atul Patel
  • Dharmendra Patel
  • ARCHANA A. SHINDE
  • HUI WU
  • Jian Liang
چکیده

Optical character recognition (OCR) method has been used in converting printed text into editable text. OCR is very useful and popular method in various applications. Accuracy of OCR can be dependent on text preprocessing and segmentation algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, complex background of image etc. We begin this paper with an introduction of Optical Character Recognition (OCR) method, History of Open Source OCR tool Tesseract, architecture of it and experiment result of OCR performed by Tesseract on different kinds images are discussed. We conclude this paper by comparative study of this tool with other commercial OCR tool Transym OCR by considering vehicle number plate as input. From vehicle number plate we tried to extract vehicle number by using Tesseract and Transym and compared these tools based on various parameters.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Recognition of Handwritten Roman Script Using Tesseract Open source OCR Engine

In the present work, we have used Tesseract 2.01 open source Optical Character Recognition (OCR) Engine under Apache License 2.0 for recognition of handwriting samples of lower case Roman script. Handwritten isolated and free-flow text samples were collected from multiple users. Tesseract is trained to recognize user-specific handwriting samples of both the categories of document pages. On a si...

متن کامل

Recognition of handwritten Roman Numerals using Tesseract open source OCR engine

The objective of the paper is to recognize handwritten samples of Roman numerals using Tesseract open source Optical Character Recognition (OCR) engine. Tesseract is trained with data samples of different persons to generate one user-independent language model, representing the handwritten Roman digit-set. The system is trained with 1226 digit samples collected form the different users. The per...

متن کامل

Recognition of Handwritten Textual Annotations using Tesseract Open Source OCR Engine for information Just In Time (iJIT)

Objective of the current work is to develop an Optical Character Recognition (OCR) engine for information Just In Time (iJIT) system that can be used for recognition of handwritten textual annotations of lower case Roman script. Tesseract open source OCR engine under Apache License 2.0 is used to develop user-specific handwriting recognition models, viz., the language sets, for the said system,...

متن کامل

Development of a multi-user handwriting recognition system using Tesseract open source OCR engine

The objective of the paper is to recognize handwritten samples of lower case Roman script using Tesseract open source Optical Character Recognition (OCR) engine under Apache License 2.0. Handwritten data samples containing isolated and free-flow text were collected from different users. Tesseract is trained with user-specific data samples of both the categories of document pages to generate sep...

متن کامل

Design of an Optical Character Recognition System for Camera-based Handheld Devices

This paper presents a complete Optical Character Recognition (OCR) system for camera captured image/graphics embedded textual documents for handheld devices. At first, text regions are extracted and skew corrected. Then, these regions are binarized and segmented into lines and characters. Characters are passed into the recognition module. Experimenting with a set of 100 business card images, ca...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012